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. Abstract 

^ \ We suggest a new approach to hypothesis testing for ergodic and stationary 

ry | ■ processes. In contrast to standard methods, the suggested approach gives a 

^ . possibility to make tests, based on any lossless data compression method even 

(NJ \ if the distribution law of the codeword lengths is not known. We apply this 

approach to the following four problems: goodness-of-fit testing (or identity 
H ■ testing), testing for independence, testing of serial independence and homo- 

geneity testing and suggest nonparametric statistical tests for these problems. 
It is important to note that practically used so-called archivers can be used for 
'— suggested testing. 

AMS subject classification: 60G10, 60J10, 62M02, 62M07, 94A29. 
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1 Introduction 

O ■ Since Claude Shannon published his famous paper "A mathematical theory of com- 

munication" [36J, the ideas and results of Information Theory have begun to play an 
important role in cryptography [2U EH] , mathematical statistics jT| 1201 I2H] , ergodic 
theory |TJ|21EE] and many other fields which are far from telecommunication. 

The theory of universal coding, which is a part of Information Theory, also has been 
efficiently applied to many fields since its discovery in [TO] EE] ■ Thus, application of 
results of universal coding, initiated in [2H|, created a new approach to prediction 

ma unci]. 

In this paper we suggest a new approach to hypothesis testing, which is based 
on ideas of universal coding. We would like to emphasize that, on the one hand, the 
problem of hypothesis testing is considered in the framework of classical mathematical 
statistics and, on the other hand, everyday methods of data compression (or archivers) 
can be used as a tool for testing. It is important to note that the modern archivers are 
based on deep theoretical results of the source coding theory (see, for ex., [5] ITrj] IT9"] 
123 EH]) and have shown their high efficiency in practice as compressors of texts, DNA 

*Research was supported by the joint project grant "Efficient randomness testing of random and 
pseudorandom number generators" of Royal Society, UK (grant ref: 15995) and Russian Foundation 
for Basic Research (grant no. 03-01-00495.). 



1 



sequences and many other types of real data. In fact, universal codes and archivers 
can find latent regularities of many kinds, that is why they look like a promising tool 
for hypothesis testing. 

1.1 The main idea of the suggested approach 

Let us describe the main idea of the suggested approach using one particular problem 
of hypothesis testing which is conceptually simple and yet is important for practise. 
Namely, we consider a null hypothesis H that a given bit sequence X\...x t is generated 
by a Bernoulli source with equal probabilities of and 1 and the alternative hypothesis 
Hi that the sequence is generated by stationary and ergodic source, which differs from 
the source under H . This problem is considered in ;32] and is a particular case of 
the goodness-of-fit testing ( or identity testing) described below, that is why we give 
an informal solution only. Let (f be a universal code, (p(x\...Xt) be the encoded 
sequence, l v (x\...x t ) be the length of the word ip(xi...x t ) and a be the required level 
of significance. Intuition suggests that the sequence cannot be compressed if H is 
true, and vice versa, if the sequence can be compressed H should be rejected. The 
corresponding formal test is as follows: if (t — l (p (xx...x t )) > log(l/a), then H should 
be rejected. (Here and below log = log 2 .) It will be proven below that the Type 
I error of this test is equal to or less than a for any (uniquely decodable) code <p, 
whereas the Type II error goes to for any universal code <p, when the sequence 
length t grows. 

Let us look at the described test in more details. It is well known that the average 
codeword length of any code is not less than the sequence length t, if H is true. 
Hence, if we define the codeword length of the best code as Z# (xi...x 4 ), we can see 
that lH {x\...x t ) = t. Now the scheme of the suggested test can be described as 
follows: If lH {x\...x t ) — l tp (xi...x t ) < log(l/a) then H , otherwise Hi. We will apply 
this scheme to all considered statistical problems, sometimes replacing the length 
lH (x\...Xt) with its lower bound (as a rule, such a lower bound will be based on 
so-called empirical Shannon entropy). 

1.2 Description of considered problems 

We consider a stationary and ergodic source (or process), which generates elements 
from some set (or alphabet) A (which can be either finite or infinite) and four problems 
of statistical testing. 

The first problem is the goodness-of-fit testing (or identity testing), which is de- 
scribed as follows: a hypotheses Hjf is that the source has a particular distribution 7r 
and the alternative hypothesis H{ d is that the sequence is generated by a stationary 
and ergodic source which differs from the source under Hjf. One particular case, in 
which the source alphabet A equals {0, 1} and the main hypothesis Hff is that a bit 
sequence is generated by the Bernoulli source with equal probabilities of 0's and l's, 
was mentioned in Introduction. 

The second problem is a generalization of the problem of nonparametric testing 
for serial independence of time series. More precisely, we consider the two following 
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hypotheses: Hq 1 is that the source is Markovian of order not larger than m, (m > 0), 
and the alternative hypothesis iff 7 is that the sequence is generated by a stationary 
and ergodic source which differs from the source under Hq 1 . In particular, if m = 0, 
this is the problem of testing for independence of time series. 

The third problem is the independence testing. In this case it is assumed that 
the source is Markovian, whose order is not larger than m, (m > 0), and the source 
alphabet can be presented as a product of d, d > 2, alphabets Ai,A 2 ,...,A d (i.e. 
A = Y[f=i Ai). The main hypothesis H™ d is that p(x m+ i = (a^, . . . , Oj d )| xi...x m ) = 
Xl d j=iP{Xm+x = %\xi—Xm) ft> r each (a h , . . . , a id ) G Yli =1 A h where x m+1 = (x^+i, 

x^ +1 ). The alternative hypothesis H\ nd is that the sequence is generated by a 
Markovian source of order not larger than m, (m > 0), which differs from the source 
under H™ d . 

In all three cases the testing should be based either on one sample x\ . . . x t or on 
a several (7) independent samples X — X • • • X ^ . ... X — X . . • X generated by the 
source. 1 

The fourth problem is the homogeneity testing. There are r samples x\ . . .x\ , 
x\ . . .x\ x\ . . . x^ and it is assumed that they are generated by Markovian sources, 
whose orders are not larger than m, (m > 0). The main hypothesis Hq 0171 is that all 
samples are generated by one source, whereas the alternative hypothesis H^ om is that 
at least two samples are generated by different sources. 

All four problems are well known in mathematical statistics and there is an exten- 
sive literature dealing with their nonparametric testing, see for review, for example, 

[num. 

1.3 Main results 

We suggest statistical tests for all problems such that the Type I error is less than or 
equal to a given a and the Type II error goes to zero, when the sample size grows. 
However, there are some additional restrictions mainly concerned with the case of 
infinite source alphabet. For this case all test are described for memoryless (or i.i.d.) 
sources only. It is important to note that the suggested tests are based on universal 
codes (and closely connected universal predictors), but the Type I error is less than 
or equal to a given a for any code and, in particular, it is true for practically used 
methods of data compressions (or archivers), that is why they can be used as a basis 
for the tests. 

1.4 Outline of the paper 

The next section contains some necessary facts and definitions. The sections three 
and four are devoted to description of the tests for the cases where alphabets are 
finite and infinite, respectively. Some experimental results and simulation studies are 
given in the section 5. 

1 For a case of one sample and a finite alphabet A some of these problems were considered by the 
authors in and reports submitted to conferences. 
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We give a description of one particular universal code in Appendix 1, because 
universal codes play a key role in this paper, but information about them is spread 
between numerous papers and they are not widely presented in statistical literature 
(in spite of the fact that universal codes have found different applications to some 
classical problems of mathematical statistics, see, for ex.,[5j). Besides, the universal 
code described in Appendix 1 is used for simulation study of serial independence 
testing in the part 5. (On the other hand, this paper focuses on hypothesis testing, 
that is why description of the universal codes and ideas behind them are put in the 
appendix.) 

The conclusion is intended to clarify the connection of the suggested approach 
and briefly describe some possible generalizations of the described tests. All proofs 
are given in Appendix 2. 



2 Definitions and Auxiliary Results 

2.1 Stochastic processes and the Shannon entropy 

Now we briefly describe stochastic processes (or sources of information). Consider an 
alphabet A, which can be either finite or infinite, and denote by A 1 and A* the set 
of all words of length t over A and the set of all finite words over A correspondingly 
(A* = \Jfli A 1 ). By Moo(y4) we denote the set of all stationary and ergodic sources, 
which generate letters from A; see for definition, for ex., [21 E] and let Mo (A) C 
M^A) be the set of all i.i.d. processes. Let M m (A) C M^A) be the set of Markov 
sources of order (or with memory, or connectivity) not larger than m, m > 0. In the 
case of a finite alphabet A Markov processes will play a key role in this paper, that 
is why we give a formal definition. By definition jj, e M m (A) if 

/^(-^t+l Q'fai'Et—l "13; • • • -fit— m+1 "im+i) 

for all t > m and a h ,a h ,... e A. Let M*(A) = \JZo M i( A ) be the set of a11 finite - 
order sources. 

Let t be a stationary and ergodic source generating letters from a finite alphabet 
A. The m— order (conditional) Shannon entropy and the limit Shannon entropy are 
defined as follows: 

h m (r) = T ( v ) r{a\ v) log r(a\v), h^r) = lim^ h m (r). (2) 
It is also known that for any m 

hoo(r) < h m (r) , (3) 
see E]. The well known Shannon-MacMillan-Breiman theorem states that 

lim -logr(xi . . .x t )/t = h^r) (4) 
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with probability 1, see |2| ITT]. 

Let v = v\...Vk and x = X\X2 ■ ■ ■ Xt be words from A*. Denote the rate of a word v 
occurring in the sequence x = x\x<i . . . Xk , X2X3 . . . Xk+i, . . . Xk+2, ■ ■ ^t-fc+i ■ ■ ■ %t 
as ^(u). For example, if x = 000100 and v = 00, then z/ x (00) = 3. For any < k < t 
the empirical Shannon entropy of order k is defined as follows: 

where 2 = sci...^, ^(f) = S a£y i ^(fa). In particular, if = 0, we obtain 
h*(x) = -t- 1 J2aeA v x (a) \og{v x (a)/t) . 

We extend these definitions to where a sample is presented as several 

(independent) sequences x l = x\ . . . x] , x 2 = x\ . . . x 2 2 , x r = x\ . . . x\ r generated 
by a source. (The point is that we cannot simply combine all samples into one, if the 
source is not i.i.d.) We denote this sample by x 1 ox 2 o . . . ox r and define t = Yh=i 
l/ x 1 ox 2 o...ox r ( v ) = Y?i=i l/ x i ( v )- F° r example, if x 1 = 0010, x 2 = 011, then zVoa; 2 (00) = 1. 
Analogously to ©, 



K{x l ox 2 o...ox r ) = - Y, T^TT 1 E ^■■ oxr{V ( a > i og v ^--^ va > , ( 6 ) 

where ^o...ox^{v) = Y,aeA^o...ox-(va). 

For any sequence of words x l = x\ . . . x\ , x 2 = x\ . . . x\ , x r = x\ . . . x r tr from 
A* and any measure 9 we define 9{x l o x 2 o . . . o x r ) = 111= 1 9(x l ). 

We will use the following well known inequality, whose proof can be found in 11]: 
For any two probability distributions p and q over some alphabet B the following 
inequality 



is valid with equality if and only if p = q. 



The value Y,beBP(b) log^M is often called Kullback-Leibler divergence. 
The following property of the empirical Shannon entropy will be used later. 



Lemma. Let 8 be a measure from M m (A), m > 0, and x 1 , . . . ,x r be words from 
A*, whose lengths are not less than m. Then 



[x 



1 o...ox r )< T^- rm) h *m(^o-o^ r ) . ( 8 ) 



2.2 Codes 

A data compression method (or code) <p is defined as a set of mappings <p n such 
that ifi n : A n — > {0, 1}*, n = 1,2,... and for each pair of different words x, y G A n 
ip n (x) 7^ f n (y)- It is also required that each sequence ip n (ui)(p n (u2)-..<p n (u r ), r > 1, 
of encoded words from the set A n ,n > 1, could be uniquely decoded into U\U2...u r . 
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Such codes are called uniquely decodable. For example, let A = {a,b}, the code 
ipi(a) = 0,^1 (6) = 00, obviously, is not uniquely decodable. It is well known that if a 
code ip is uniquely decodable then the lengths of the codewords satisfy the following 
inequality (Kraft inequality): S uG a« 2^ Vn ^ < 1 , see, for ex., jTTj. (Here and below 
\v\ is the length of v, if v is a word and the number of elements of v if v is a set.) It 
will be convenient to reformulate this property as follows: 

Let ip be a uniquely decodable code over an alphabet A. Then for any integer n 
there exists a measure fi^ on A n such that 



for any u from A n . 

It is easy to see that it is true for the measure ^(u) = 2~l ¥ 'Wy E u6j 4n 2~^ u ^. In 
what follows we call uniquely decodable codes just "codes". 

We suppose that any code is defined for each sequence of words x 1 o x 2 o ... o x l . 
(For example, any code ip can be extended to this case as follows: ip(x 1 ox 2 o ... ox') 
= ip(x 1 )p>(x 2 )...(p(x 1 ).) 

There exist so-called universal codes. To introduce these codes we first recall 
that (as it is known in Information Theory) sequences x\ . . . Xf, generated by a 
source p, can be "compressed" up to the length — \ogp(x\...x t ) bits; on the other 
hand, for any source p there is no code ip for which the average codeword length 
E ue ^t p(u)\ip(u)\ is less than — E ug ^t p{u) \ogp{u). Universal codes can reach the 
lower bound — \ogp(xi...x t ) asymptotically for any stationary and ergodic source p 
with probability 1. 

A formal definition is as follows: A code ip is universal if for any stationary and 
ergodic source p 



with probability 1. So, informally speaking, universal codes estimate the probability 
characteristics of the source p and use them for efficient "compression". One of the 
first universal codes was described in |28j . see also |29j, and now there are many 
efficient universal codes and universal predictors connected with them, see |13| \W\ 

[23 EE EH EH] • 

3 Tests For A Finite Alphabet 

3.1 Goodness-of-fit testing or identity testing 

Now we consider the problem of testing the hypothesis H^f that the source has a 
particular distribution tt,ti G M^A), against H[ d that the source is stationary and 
ergodic and differs from 7T. Let the required level of significance (or the Type I error) 
be a, a G (0, 1). We describe a statistical test which can be constructed based on any 
code ip. 

The main idea of the suggested test is quite natural: compress a sample x by 
a code ip. If the length of the codeword \<p(x)\ is significantly less than the value 



log/v(«) < \<p(u)\ 



(9) 



lim t (—\ogp(xi...x t ) — \(p(xi...x t )\) = 

t — >oo 



(10) 
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— log7r(x), then H^f should be rejected. The key observation is that the probability 
of all rejected samples is quite small for any tp, that is why the Type I error can be 
made small. The formal description of the test is as follows: 

Let there be a sample x presented by sequences x l = x\ . . . , . . . , x l — x[ . . . x\ , 
generated independently by a source. The hypothesis H 1 q is accepted if 

— log ir(x) — \(p(x)\ <— logo;. (11) 

Otherwise, H^f is rejected. We denote this test by T£ d (A, a). 

Theorem 1. i) For each distribution ir, a G (0,1) and a code (p, the Type I 
error of the described test T^ d (A,a) is not larger than a and ii) if, in addition, n 
is a finite-order stationary and ergodic process over A°° (i.e. n G M*{A)), p is a 
universal code then the Type II error of the test T^ d (A, a) goes to as the sample size 
t (t — Yn=i U ) tends to infinity. 

3.2 Testing of serial independence 

Let there be a sample x presented by sequences x l = x\ . . . x\ x , . . . , x l = x[ . . . x\ , gen- 
erated independently by a (unknown) source and let t = X^=i U- The main hypothesis 
Hq 1 is that the source is Markovian, whose order is not greater than m, (m > 0), 
and the alternative hypothesis Hf 1 is that the sample x is generated by a station- 
ary and ergodic source whose order is greater than m (i.e. the source belongs to 
M 0O (A)\M m (A)). The suggested test is as follows. 

Let ip be any code. By definition, the hypothesis Hq 1 is accepted if 

(t-ml)h* m (x)-\p(x)\<\og(l/a), (12) 

where a G (0, 1). Otherwise, H^ 1 is rejected. We denote this test by T^(A,a). 

Theorem 2. i) For any code <p the Type I error of the test T^ T (A,a) is less 
than or equal to a, a G (0,1) and, ii) if, in addition, ip is a universal code and the 
sample size t tends to infinity, then the Type II error goes to 0. 

3.3 Independence testing 

Now we consider the problem of the independence testing for Markovian sources. 
It is supposed that the source alphabet A is the Cartesian product of d alphabets 
A±, Ad, i.e. A = nf=i Ai, d > 2 and it is known a priori that the source belongs 
to M m (A) for some known to, m > 0. We present each letter x as x = (x^\ . . . , x^), 
where x^ G Aj. The hypothesis H™ d is that /i G M m (A) is such a source that for 
each a = (a^, . . . , a^) G Y\ d =1 Ai and each x\...x m G A rn the following equality is 
valid: 

d 

fx(x m+1 = (a {l \...,a {d) )\x 1 ...x m ) = fj // (i) (^m+i = a {{) \x 1 ...x m ), (13) 

i=i 

where, by definition, 

^\x { Xi = a\x 1 ...x m )= (14) 
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Y Y K x m+i = (h, ■ ■ ■ ,h-i,a, bi+i, ■ ■ ■ ,bd)\ xi...x m ). 

6i,...,6 i _ien*=i a j b i+ir-' b d^Uj=i+i Aj 

The hypothesis H[ nd is that the equation (|13jl is not valid at least for one (a^\ . . . , a^) 
e U.i =1 Ai and x 1 ...x m G A m . 

Let us describe the test for hypotheses H™ d and H] nd . Suppose that there is 
a sample x presented as sequences x l = x\ . . . x\ , . . . , x l = x[ . . . xl , generated 

independently by a source, where, in turn, any x{ = (x 3 ^, ...,x{^). We define 
* = Yh=i U an d = x{^ . . . O . . . o x 1 } fe) . . . x 1 }^ for k = 1, 2, d. 
Let ip be any code. By definition, the hypothesis H™ d is accepted if 

J2(t-ml) h* m (xW) - \<p(x)\ < Iog(l/a) , (15) 

k=l 

a G (0, 1). Otherwise, H™ d is rejected. We denote this test by T™ d (A, a). First we 
give an informal explanation of the main idea of the test. The Shannon entropy is 
the lower bound of the compression ratio and the empirical entropy h* m {x^) is its 
estimate. So, if HQ nd is true, the sum J2t=i{t — ml) h* m {x^) is, on average, close to 
the lower bound. Hence, if the length of a codeword of some code tp is significantly 
less than the sum of the empirical entropies, it means that there is some dependence 
between components, which is used for some additional compression. The following 
theorem describes the properties of the suggested test. 

Theorem 3. i) For any code ip the Type I error of the test T™ d (A, a) is less than 
or equal to a, a G (0, 1), and ii) if, in addition, ip is a universal code and t tends to 
infinity, then the Type II error of the test T™ d (A, a) goes to 0. 

3.4 Homogeneity testing 

Let there be r samples x 1 = x\ . . . x] , x 2 = x\ . . . xf 2 , x r = x\ . . . x£ , (r > 2) , and 
it is assumed that they are generated by Markovian sources, whose orders are not 
larger than m, (m > 0) and m is known a priory (i.e. the sources belong to M m (A)). 
The null hypothesis Hq ™ is that all samples are generated by one source, whereas the 
alternative hypothesis H^ om is that at least two samples are generated by different 
sources. 

Let us describe the test for hypotheses Hq " 1 and H\ om . Let p> be any code, 
t = J2 r i=±ti an d a G (0, 1). By definition, the hypothesis H^ om is accepted if 

r 

(t - mr)h* m (x 1 ox 2 o...ox r )-Y\v(x i )\ < log(l/a) . (16) 

i=l 

Otherwise, Hq 0171 is rejected. We denote this test by T^ om (A, a). 

Theorem 4. i) For any code ip the Type I error of the test T^ om (A,a) is less 
than or equal to a, a G (0,1) and ii) if, in addition, ip is a universal code and the 
sample size t goes to infinity in such a way that there exists a positive constant c for 
which 

c < tj/t (17) 



S 



for each j , then the Type II error of the test T^ om (A, a) goes to 0. 

Let us give some comments concerning the constant c. In fact, the existence of such 
a constant means that all samples are present and grow. Otherwise, some samples 
could have a negligible length, say, 1 letter and, obviously, it would be difficult to 
build a reasonable test for such a case. 

The suggested test can be extended for a case where it is known beforehand that 
some sequences (from x 1 , x 2 , . . . , x r ) were generated by the same source. In this case 
the same test can be applied, but the condition ii) can be weaken as follows: for each 
source the inequality (fT7|) must be valid for at least one sample. 

4 Infinite Alphabet 

In this part we consider the case where the source alphabet A is infinite, say, a 
part of R n . Our strategy is to use finite partitions of A and to consider hypothesis 
corresponding to the partitions. The main problem of this approach is as follows: if 
someone combines letters (or states) of a Markov chain, the chain order (or memory) 
can increase. For example, if an alphabet contains three letters, there exists a Markov 
chain of order one, such that combining two letters into one subset transfers the chain 
into a process with infinite memory. On the other hand, the main part of results 
described above is valid for finite-order processes. That is why in this part we will 
consider i.i.d. processes only (i.e. processes from M (A)). 

In order to avoid numerous repetitions, we will consider a general scheme, which 
can be applied to all tests using notations Hq,H^ and T^(A,a), where N is an 
abbreviation of one of the described tests (i.e. id, SI, ind and hom.) 

Let us give some definitions. Let A = Ai, X s be a finite (measurable) partition 
of A and let A(x) be an element of the partition A, which contains x G A. For any 
process 7i we define a process 7Ta over a new alphabet A by equation 

7C A (\ h ...\ ik ) = 7l(Xi G X h , ...,X k G AjJ, 

where x\...x k G A k . (Such partitions are widely used in information theory; see, for 
ex., [ni[7l[TT] for a detailed description.) 

We will consider an infinite sequence of partitions A = Ai, A2, .... and say that such 
a sequence discriminates between a pair of hypotheses H$(A),Hf(A) about processes 
from Mq(A), if for each process g, for which Hf(A) is true, there exists a partition Aj 
for which Hf(Aj) is true for the process g&.. We also define a probability distribution 
{u = ui\ , 0J2 , ...} on integers {1, 2, ...} by 

u>! = 1-1/ log 3, ...,Ui =l/log(z + l)-l/log(z + 2), ... . (18) 

(In what follows we will use this distribution, but the theorem described below is 
obviously true for any distribution with nonzero probabilities.) 

Let Hq(A), Hf(A) be a pair of hypotheses, A = Ai, A 2 , ... be a sequence of parti- 
tions, a be from (0, 1) and cp be a code. The scheme for all the tests is as follows: 
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The hypothesis Hq(A) is accepted if for all i = 1,2,3,... the test T^(Ai, (au>i)) 
accepts the hypothesis H^Ai). Otherwise, is rejected. We denote this test as 
T^(A). 

Comment. It is important to note that one does not need to check an infinite 
number of inequalities when one applies this test. The point is that the hypothesis 
Hq(A) has to be accepted if the left part in (fTTj) . (|12p. (fTo]) and (|TB|) . correspondingly, 
is less than — log(ac<jj). Obviously, — log(ac<jj) goes to infinity if i increases. That is 
why there are many cases, where it is enough to check a finite number of hypotheses 

Theorem 5. i) For each a 6 (0, 1), sequence of partitions A and a code (p, the 
Type I error of the described test (A) is not larger than a and ii) if in addition, 
ip is a universal code and A discriminates between Hq(A), H^A)^ , then the Type II 
error of the test T^^A) goes to 0, when the sample size tends to infinity (in the case 
of the homogeneity testing, in addition, the inequality [T7\) should be valid). 

5 The Experiments 

In this part we describe results of some experiments and a simulation study car- 
ried out to estimate an efficiency of the suggested tests. The obtained results show 
that the described tests as well as the suggested approach in general can be used in 
applications. 

5.1 Randomness testing 

First we consider the problem of randomness testing, which is a particular case of 
goodness-of-fit testing. Namely, we will consider a null hypothesis Hq 1 that a given bit 
sequence is generated by Bernoulli source with equal probabilities of and 1 and the 
alternative hypothesis that the sequence is generated by a stationary and ergodic 
source which differs from the source under H^. This problem is important for random 
number (RNG) and pseudorandom number generators (PRNG) testing and there are 
many methods for randomness testing suggested in literature. Thus, recently National 
Institute of Standards and Technology (NIST, USA) suggested "A statistical test suite 
for random and pseudorandom number generators for cryptographic applications" , see 

m 

We investigated linear congruent generators (LCG), which are defined by the 
following equality 

X n+1 = (A* X n + C) mod M, 

where X n is the n-th generated number Each such generator we will denote by 
LCG(M, A, C, X ), where X is the initial value of the generator. Such generators 
are well studied and many of them are used in practice, see [T7| . 

In our experiments we extract an eight-bit word from each generated X, using the 
following algorithm. Firstly, the number \x = \_M/256\ was calculated and then each 
Xi was transformed into an 8-bit word X; t as follows: 
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Xi = [Xi/256\ ifXi < 256/i 1 (ig) 
Xi = empty word ifXi > 256fi J 

Then a sequence was compressed by the archiver ACE v 1.2b (see jhttp:// www.wmace.com 
Experimental data about testing of three linear congruent generators is given in the 
table 1. 



Table 1: Pseudorandom number generators testing 



parameters / length (bits) 


400 000 


8 000 000 


M,A,C, X 






10 8 + 1,23, 0,47594118 


390 240 


7635936 


2 31 ,2 16 + 3,0,1 


extended 


7797984 


2 32 , 134775813, 1,0 


extended 


extended 



So, we can see from the first line of the table that the 400000— bit sequence 
generated by the LCG(10 8 + 1,23,0,47594118) and transformed according to (fT§|). 
was compressed to a 390240— bit sequence. (Here 400000 is the length of the sequence 
after transformation.) If we take the level of significance a > 2~ 9760 and apply the test 
T* d ({0, 1}, a), ((p — ACE v 1.2b), the hypothesis Hq* should be rejected, see Theorem 
1 and (JTTJ). Analogously, the second line of the table shows that the 8000000— bit 
sequence generated by LCG(2 31 , 2 16 +3, 0, 1) cannot be considered random (H^ should 
be rejected if the level of significance a is greater than 2 -202016 ). On the other hand, 
the suggested test accepts Hq" for the sequences generated by the third generator, 
because the lengths of the "compressed" sequences increased. 

The obtained information corresponds to the known data about the considered 
generators. Thus, it is shown in that the first two generators are bad whereas 
the third generator was investigated in |22] and is regarded as good. So, we can see 
that the suggested testing is quite efficient. 

In a recently published paper (32] the described method was applied for testing 
random number and pseudorandom number generators and its efficiency was com- 
pared with the mentioned methods from "A statistical test suite for random and 
pseudorandom number generators for cryptographic applications" [2Z1- The point is 
that the tests from [2Zj are selected basing on comprehensive theoretical and experi- 
mental analysis and can be considered as the state-of-the-art in randomness testing. 
It turned out that the suggested tests, which were based on archivers RAR and ARJ, 
were more powerful than many methods recommended by NIST in |2*7] : see [32] for 
details. 



5.2 Simulation study of serial independence testing 

A selection of the simulation results concerning independence tests is presented in 
this part. We generated binary sequences by the first order Markov source with 
different probabilities (see table 2 below) and applied the test T^({0, I}, a) to test 
the hypothesis Hq 1 that a given bit sequence is generated by Bernoulli source and 
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the alternative hypothesis that the sequence is generated by a stationary and 
ergodic source which differs from the source under Hq . 

We tried several different archivers and the universal code R described in Appendix 
2. It turned out that the power of the code R is larger than the power of the tried 
archivers, that is why we present results for the test Tf 7 ({0, 1}, a), which is based on 
this code, for a = 0.01. The table 2 contains results of calculations. 

Table 2: Serial independence testing for Markov source of order 6 ("rej" means re- 
jected, "acc" - accepted. In all cases p(x i+ i = 0|xj = 1) = 0.5 ) 



probabilities 




/ length (bits) 


2 9 


2 14 


2 16 


2 18 


2 23 


p(x i+1 = 0\xi 


= 0) 


= 0.8 


rej 


rej 


rej 


rej 


rej 


p{x i+ i = 0\xi 


= 0) 


= 0.6 


acc 


rej 


rej 


rej 


rej 


p(x i+1 = 0\xi 


= 0) 


= 0.55 


acc 


acc 


rej 


rej 


rej 


p(x i+ i = 


X{ 


= 0) 


= 0.525 


acc 


acc 


acc 


rej 


rej 


p{x i+ x = 


X{ 


= 0) 


= 0.505 


acc 


acc 


acc 


acc 


rej 



We know that the source is Markovian and, hence, the hypothesis H^ 1 (that a 
sequence is generated by Bernoulli source) is not true. The table shows how the value 
of the Type II error depends on the sample size and the source probabilities. 

The similar calculations were carried out for the Markov source of order 6. We 
applied the test T^({0, 1}, a), a = 0.01, for checking the hypothesis H^ 1 that a given 
bit sequence is generated by Markov source of order at most 5 and the alternative 
hypothesis Hf 1 that the sequence is generated by a stationary and ergodic source 
which differs from the source under Hq 1 . Again, we know that Hq 1 is not true and 
the table 3 shows how the value of the Type II error depends on the sample size and 
the source probabilities. 

Table 3: Serial independence testing for Markov source of order 6. In all cases 



p(x i+1 = | (£}=,_ 


3 X i) 


modi - 


= 1) 


= 0.5. 












probabilities 








/ length (bits) 


2 14 


2 18 


2 20 


2 23 


2 28 


p(x i+1 = 0| (EJ-=i- 


-6 X j 


) mod2 


= 0) 


= 0.8 


rej 


rej 


rej 


rej 


rej 


p(x l+1 = 0| (£}=<_ 


-e x j 


) modi 


= 0) 


= 0.6 


acc 


rej 


rej 


rej 


rej 


p(x i+ i = 0| (E} =i - 


-6 X j 


) modi 


= 0) 


= 0.55 


acc 


acc 


rej 


rej 


rej 


p(x i+1 = o| (E}=i_ 


-6 x j 


) modi 


= 0) 


= 0.525 


acc 


acc 


acc 


rej 


rej 


p(x i+ i = o| (EU 


-6 X j 


) modi 


= 0) 


= 0.505 


acc 


acc 


acc 


acc 


rej 



6 Conclusion. 

In this part we point out some generalizations of the suggested approach as well as 
clarify the connection with some statistical methods. 

Having taken into account the Kraft inequality 0, we can rewrite the goodness- 
of-feet test (fTTjl as follows: 
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if 7r(x) / fx v (x) > a then H Q , otherwise Hi, (20) 

where, as before, n v (x) = 2~ M ^ /S n6A t 2"^ (u)l , t is the sample size. Clearly, (pUjl 
looks like the likelihood ratio test, which is one of the main statistical tools. Moreover, 
all other tests can be presented in the same manner. Thus, if we denote 2~^~ lm ^ h m( x ) 
from (|12j) by 7r, we can rewrite the serial independence test (|12j) in the same form 
as (j2DJl . The same is true for the independence testing ([To} and homogeneity testing 
flUU), if we denote by vr the values 2-^ d ^ t - lm)h *^ k)) and 2"(* -™0'C(^ 2 o...o*'-) ) 
corresp ondingly. 

Now we use the representation (f2D|) in order to extend the suggested tests to the 
following more general case. Let there be several codes (or archivers) tpi, if 2, • • • , y?z 
and we want to build a test, which is based on all of them. In order to get such a 
test, we define the "mixture" probability distribution and the mixture distribution of 
codeword lengths by equalities 

/W(x) = (2-1^1 + 2-1^1 + . . . + 2->'(*>l)/Z, \<p mix {x)\ = -log/w(z), 

correspondingly. Obviously, the Kraft inequality © is valid for \<p m i x \ and, therefore, 
I V 9 mix I can be used in all suggested tests instead of \(p\. In the case when the set 
of codes (fi,<f 2 , ■ ■ ■ is infinite, we can use some probability distribution r on the set 
1, 2, 3, ... and define [i m ix and \ip m i x \ as follows: 

00 

flmix(x) = ^Ti2~ l<ft(a:)l , \ipmix{x)\ = -\og/J mix (x). (21) 
i=l 

(For example, the distribution uj (fTSj) can be used here as the distribution r.) 

It can be easily seen from the descriptions of the tests that their power is grater, 
if the length of the codeword \(f{x)\ is less. That is why it is natural to look for a 
code ifi whose length is minimal. First of all we can find such a code (p$ that 

- log (t s 2" i ^ )I ) = min (- log (r< 2" l ^ )l )). (22) 

i 

Having taken into account ([21)1 . we can see that 

-log {t s 2-1^)1) < \<p mix {x)\. 

If we denote by f m m the code, whose codeword length \ip mm (x)\ = — log (r^ 2^' fis ^) 
for each x, the later inequality shows that \<p m m{x)\ < l^mia;^)! f° r any sample x, 
and, hence, the power of the tests based on the code ip m m is not less than the power 
of the tests based on the code ip m %x- 

It is worth noting that codes <p m i x and <f mm (and corresponding distributions, 
which based on the Kraft inequality 0), were applied for constructing optimal uni- 
versal codes and predictors in (23 EH] and later both constructions were used in 
mathematical statistics and related fields under different names (aggregating strat- 
egy, weighted majority algorithms, etc.). 
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One of the reason of a popularity of both constructions is their asymptotical 
optimality. Thus, in case of hypothesis testing, the codes ip m i x and y? mm give, in a 
certain sense, the most powerful (asymptotically) tests. Indeed, if we suppose that 
the family of codes ipi, <f2, ■■■ contains a code <p pt, whose codeword length (\(p opt (x)\) 
is minimal (say, with probability l,when the sample size increases), we can see from 
the definitions tp mix and ip mm that \ip mix (x)\ < \(p opt (x)\ + const and \ip mm {x)\ < 
\fopt( x )\ + const, where const = — logr opt . On the other hand, for any processes 
(whose entropy is larger than zero), the codeword length \ip opt {x)\ goes to infinity, if 
the sample size (| x\ ) increases and, hence, the impact of const decreases. 

7 Appendix 1. Predictors and Universal Codes 

Let a source generate a message X\ . . . x t -\X t ■ ■ ., Xi G A for all i. After the first t let- 
ters Xx, . . . , x t -i,x t have been processed the following letter x t+ i needs to be predicted. 
By definition, the prediction is the set of non-negative numbers 7(ai|xi • ■ -x t ), ■ ■ ■ , 
7(a n |xi ■ • ■ Xt) which are estimates of the unknown conditional probabilities p(a-\\x\ ■ ■ ■ x t ), 
■ • ■ , p(a n \xi ■ ■ ■ x t ), i.e. of the probabilities p(x t +i — cn\xi • ■ • x t ); i — 1, ■ ■ ■ , n. 
Laplace suggested the following predictor: 

L (a\ Xl ■ ■ ■ x t ) = (v Xv .. Xt (a) + l)/(t+\A\), (23) 

see P]. (We use L here in order to show that it is intended to predict sources 
from Mq(A). Later this predictor will be extended to Mi(A), i > 0.) For example, 
if A — {0,1}, X\...x§ = 01010, then the Laplace prediction is as follows: -^0(^6 
I) 010 If); = (3 + l)/(5 + 2) = 4/7, L (x 6 = 1|01010) = (2 + l)/(5 + 2) = 3/7. 

It is natural to estimate the error of prediction by the the Kullback-Leibler (K-L) 
divergence between a distribution p and its estimation. Consider a source p and a 
predictor 7. The error is characterized by the divergence 

p 7iP (xi • • • x t ) = V p(a\x l ■■■x t ) log p aXl ^* (24) 

^ -f{a\x 1 ---x t ) 

As we mentioned above, for any distributions p and 7 the K-L divergence is nonneg- 
ative and equals if and only if p(x) = j(x) for all x. For fixed t, r 7iP is a random 
variable, because X\,X2, • ■ ■ ,Xt are random variables. We define the average error at 
time t by 

p*(p||7) = E ( r 7,p(")) = Z! P( Xl '" x t) PiA x i ■■■ x t)- (25) 

It is shown in j^O] that the error of Laplace predictor goes to for any i.i.d. source 
p. More precisely, it is proven that 

r i (p||L )<(|0l|-l)/(t + l) (26) 

for any source p; ( see also jMj)- 
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For any predictor 7 we define the corresponding probability measure by 



t 

j(x 1 ...X t ) = Y[ Tfo I X l ■ ■ ■ X i-l)- ( 27 ) 
i=l 

For example, the Laplace measure Lq of the word x\ . . . xt = 0101 is as follows: 
Lq(OIOI) = \\\\ = i. By analogy with fl23) and fl2SJ) we define 

p ltP (xi...x t ) = r 1 (\og(p(xi...x t )h{x 1 ...x t )) (28) 

and 

Pt(l,p)=t~ 1 p{xi...x t )\og(p(x 1 ...x t )/'~i(xi...Xt)). (29) 

xi...xt&A* 

For example, from those definitions and ()26|) we obtain the following estimation for 
Laplace predictor L and any i.i.d. source p : 

p t {L ,p) <{logt + c)/t, (30) 

where c is a constant. 

The average error ()29)1 has three interesting characteristics. Firstly, it can be 
easily seen from (|24|) . (|25|) and ()29|) that pt(7,p) is the average error of the predictor 
7 when it is applied to the process p : 

t 

Pt{i,p) = ^ 1 Y.p ! (ph)- 

Secondly, having taken into account the definition of the Shannon entropy (0), we 
can easily see that for p £ Mo (A) 

p t ( 7 ,p) =t- 1 E p (-log 7 (a;i...a; t )) - Vp)- (31) 

The third characteristic is connected with the theory of universal coding. One can 
construct a code with codelength 7 CO£ fe(a|xi • • • Xt) ~ — log 2 7(a|xi • • • x n ) for any letter 
a £ A (since Shannon's original research, it has been well known, cf. e.g. jllj . that, 
using block codes with large block length or more modern methods of arithmetic 
coding, the approximation may be as accurate as you like). If one knows the real 
distribution p, one can base coding on the true distribution p and not on the prediction 
7. The difference in performance measured by average code length is given by 

J2p(a\x 1 ■ ■ -x t )(-log 2 7(a|xi • ■ ■ x t )) - J2p(a\x ± ■ ■ • x t )(-log 2 p(a\x 1 ■ ■ ■ x t )) 

aeA aeA 

V- i 1 m p(a\xi---x t ) 

Thus this excess, it is exactly the error (|2"2|) defined above. Analogously, if we 
encode the sequence X\. . .x t based on a predictor 7 the redundancy per letter is 
defined by (J28j) and (J29|) . So, from mathematical point of view the universal prediction 
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and universal coding are identical. But — logj(xi...xt) and — \ogp(xi...x t ) have a 
very natural interpretation. The first value is a code word length (in bits), if the 
"code" 7 is applied for compressing of the word x\...x t and the second one is the 
minimally possible codeword length. The difference is the redundancy of the code 
and, at the same time, the error of the predictor. It is worth noting that there are 
many other deep interrelations between universal coding, prediction and estimation, 
see |23 

As we saw in (p?0J) . the average error of the Laplace predictor is upper bounded 
by (\A\ — l)(logt + 0(1)) jit + 1), when t grows. Krichevsky suggested the predictor 

a\x\ ■ ■ ■ Xt) — (z/ Xl ... Xt (a) + l/2)/ (^+1^41/2) and showed that the error of this predic- 
tor is asymptotically less: p t (K ,p) is upper bounded by (\A\ — l)(logt + 0(l))/(2t). 
Moreover, he showed that this predictor is asymptotically optimal in the sense that 
for any other predictor 7 there exists a source p for which the error pt{l,P) is not less 
than (\A\ - l)(logt + 0(l))/(2t), see [H]. 

From definitions (|2~3j) and (|2*7j) we can see that the Laplace predictor ascribes the 
following probabilities: 

° ( 1 " l) " l\ i-l + \A\ ~ ((t+L4|-l)!)/(L4|-l)! " (32) 
Analogously, for K Q we obtain 

K "^ Xl) = S i-l + |A|/2 = IlU(i + |A|/2) ■ <33) 

The following simple example shows the difference between the predictors: If A = 
{0, 1} and x\ . . . x t = 0101, then L and K ascribe the probabilities 332! = 30 anc ^ 
IHI = its' correspondingly. 

The product (r + l/2)((r + 1) + 1/2)... (s — 1/2) can be presented as a ratio ^+1/2) > 
where T( ) is the gamma function (see for definition, for ex., ). So, can be 
presented as follows: 

T(\A\/2) n a£ Ar(»/ gl ...x t (a) + l/2) 

Ko{xi - Xt) = wm — m+\A\/2) — • (34) 

As we mentioned above the average error of the Krichevsky predictor is asymptotically 
minimal. That is why we will focus our attention on this predictor and, for the sake 
of completeness, we prove an upper bound for its error. 

Claim 1. For any stationary and ergodic source generating letters from a finite 
alphabet A the average error of Kq is upper bounded as follows: 

- t- 1 £ p(x 1 ...x t ) log(K (x 1 ...x t )) - h (p) < {{\A\ - 1) logt + C)/(2t), 

x\...xt^A l 

where C is a constant. 

Proof is given in the Appendix 2. 
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Comment. In particular, if the source is i.i.d., the average error is less than 
((\A\-l)\ogt + C)/(2t); see 

We indicated that extensions of both predictors to cover the general Markov case 
are possible. We take this up now. The trick is to view a Markov source p G M m (A) 
as resulting from \A\ m i.i.d. sources. We illustrate this idea by an example from 
So assume that A = {O, I}, m = 2 and assume that the source p G M 2 (A) has 
generated the sequence 

OOIOIIOOIIIOIO. 
We represent this sequence by the following four subsequences: 

^ s§s s|s ^ ^ ^ -fc -fc -fc -fc -fc^K^ 

* * *o o, 

* * * * J * *o * * * 

^^^^^^O*** 10 * *. 

These four subsequences contain letters which follow 00, 01, 10 and II, respectively. 
By definition, p G M m (A) if p(a\x\ ■ ■ ■ x t ) = p(a\x t - m +i • • for all < m < t, all 
a e A and all x\---x t G A*. Therefore, each of the four generated subsequences 
may be considered to be generated by a Bernoulli source. Further, it is possible to 
reconstruct the original sequence if we know the four (= \A\ m ) subsequences and the 
two (= m) first letters of the original sequence. 

Any predictor 7 for i.i.d. sources can be applied for Markov sources. Indeed, in or- 
der to predict, it is enough to store in the memory \A\ m sequences, one corresponding 
to each word in A m . Thus, in the example, the letter x% which follows 00 is predicted 
based on the Bernoulli method 7 corresponding to the x\x 2 - subsequence (= 00), 
then £4 is predicted based on the Bernoulli method corresponding to x 2 x 3 , i.e. to the 
01- subsequence, and so forth. When this scheme is applied along with either L or K 
we denote the obtained predictors as L m and K m , correspondingly and define the prob- 
abilities for the first m letters as follows: L m (xi) = L m (x 2 ) = ■ ■ ■ L m (x m ) = 1/\A\ , 
K m (xi) = K m (x 2 ) = ... K m (x m ) = 1/\A\ . 

Having taken into account (J52j) and (jHH), we can present the Laplace and Krichevsky 



predictors for M m (A) as follows: 



1 

|A|*> 



if t < m ; 

M*i-..**) = < n^KM)! (35) 

\A\m llveA™ ((u x (v)+\A\)\)/(\A\-l)l ' 11 1 > m ' 

jaj, ift<m; 

Km{Xl '-- Xt) = { ^ { n\A\,2), Aru IL FA (r(^)+i/2) (36) 

where u x (v) = EaeA v x (va), x = x\...x t . 

We have seen that any source from M m (A) can be presented as a "sum" of \A\ m 
an i.i.d. sources. From this we can easily see that the error of a predictor for the 
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source from M m (A) can be upper bounded by the error of i.i.d. source multiplied by 
| ^4. | . In particular, we obtain from Claim 1 the following upper bound. 

Claim 2. For any stationary and ergodic source generated letters from a finite 
alphabet A the average error of the Krichevsky predictor K m is upper bounded as 
follows: 

- t- 1 J2 P(xv..x t ) \og(K m ( Xl ...x t )) - h m (p) < \A\ m ((\A\ - 1) log* + C)/(2t), 

x\...xt&A l 

where C is a constant. 

Now we can describe the universal predictor R and code R co de from [2E1 123 • By 
definition, 

oo 

R{x x ...x t ) = K i (x 1 ...x t ), 

R(x t \ xi...x t -i) = R(x 1 ...x t )/R(xi...x t -i) 

and \R co de(xi...x t )\ = — logR(xi...x t ). It is worth noting that this construction can 
be applied to the Laplace predictor (if we use Li instead of KA and any other family 
of predictors (or codes). 

Claim 3. Let the predictor R be applied to a source p. Then, for any stationary 
and ergodic source p £ M^A) the error of the predictor R goes to 0, when the 
sample size t goes to oo. 

Proof can be derived from Claim 2 and the properties of the Shannon entropy. 
Indeed, we can see from the definition of R and Claim 2 that the average error is 
upper bounded as follows: 

~ rl P(xi---xt) log(R(xx...x t )) - h k (p) 

x\...xt^A t 

< (\A\ k (\A\ - 1) log* + log(lM) + C)/{2t), 

for any k — 0, 1, 2, .... Taking into account that for any p £ M^A) lim^oo hk(p) = 
hoo(p), we can see that 

( lim r 1 V p{x 1 ...x t ) \og{R{x 1 ...x t )) - h^ijp)) = 0. 

t— >oo — ' 

The main property of the universal codes (fTUjl is also true for R co d e and can be easily 
derived from Claim 3 using standard techniques of ergodic theory. 

8 Appendix 2. Proofs 

Proof of the Lemma. First we show that for any source 8* £ Mq(A) and any words 

111 y r J- 

J/^ V tl , dj l/^, 

e*^ 1 o ... o x r ) = n (0*(a)y* 1 «~^ {a) 
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< U(^<>....Aa)/tr^^ ia \ (37) 

aeA 

where t = Yh=iU- Here the equality holds, because 9* G M (A) . The inequality 
follows from (J2J). Indeed, if p(a) = i^x 1 o...ox r ( a ) A and g(a) = 0*(a), then 

\p ^• 1 o...ox' r ( a ) i iyx 1 o...ox r {. a ) /t) . n 

^ * og W) - ■ 

From the latter inequality we obtain (JBTj) . Taking into account the definition (jUJ) and 
f!37|) . we can see that the statement of Lemma is true for this particular case. 

For any 9 G M m (A) and x — x\ . . . x s , s > m, we present 9(x± . . . x s ) as 9(xi . . . x s ) = 
9(xi . . . x m ) Y[ u( zA m UaeA @( a \ u) Wx ^ ua ' , where 9{x\ . . . x m ) is the limit probability of the 
word Xx . . . x m . Hence, 9(x 1 . . . x s ) < H u eA m UaeA @( a \ u) Ux<yUa ^ . Taking into account 
the inequality (J37|), we obtain Y[ a&A 9(a\ u ) u * {ua) < Y[ aeA {v x {ua) / v x {u)) v ^ for any 
word u. Hence, 



9{x x ...x s )< n n 



a u 



,v x (ua) 



ueA m aeA 



i/ x (ua) 



- II II (vx(ua) / v x (u)) 

ueA m aeA 

If we apply those inequalities to 9{x 1 o ... ox r ), we immediately obtain the following 
inequalities 

9(x i <>...ox r )< n n e ( a \ uy^-°* Aua) < 

ueA m aeA 

n n (*vo...o^ w/^.-.^w)^--^^. 

ueA m aeA 

Now the statement of the Lemma follows from the definition (jHJ). 

Proof of Theorem 1. In order to avoid cumbersome notations we first consider a 
case where the sample x is one sequence X\...x t and then note how the proof can be 
extended for the general case. Let C a be a critical set of the test T^ d (A, a), i.e., by 
definition, C a = {u : u G A 1 & — log7r(w) — \${u)\ > — loga}. Let fi v be a measure 
for which © is true. We define an auxiliary set C a = {u : — log7r(?i) — (— log AV(w)) 
> -logo;}. We have 1 > T, u ec a - T, ue c a n ( u )/ a = (^/a)n(C a ). (Here the 

second inequality follows from the definition of C a , whereas all others are obvious.) 
So, we obtain that 7r(C a ) < a. From definitions of C a , C a and Q we immediately 
obtain that C a D C a . Thus, Tr(C a ) < a. By definition, 7r(C a ) is the value of the Type 
I error. The first statement of the theorem 1 is proven. 

Let us prove the second statement of the theorem. Suppose that the hypothesis 
H{ d (A) is true. That is, the sequence x\ . . . Xt is generated by some stationary and 
ergodic source r and r ^ ir. Our strategy is to show that 

lim — log 7r(xi . . . x t ) — \<f{x\ . . . x t ) \ = oo (38) 

t — >oo 

with probability 1 (according to the measure r) . First we represent (}3*Hj) as 

- log 7r(xi ...Xt) - \ip(x! ...x t )\ 
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= t{\ log T ^ Xl ' ' ' — + y(-logr(xi ...x t )- \ip{x x . . .x t )\)). 
t ir(x\ . . . Xt) t 

From this equality and the property of a universal code (JTUJl we obtain 

- log^ ...x t )- \tp( Xl ...x t )\ =t (- log T( . Xl ■ ■ ■ Xt \ + o(l)). (39) 

t 7T{Xi...Xt) 

From (J2J - (JU) we can see that 

lim -logr(xi...x t )/t < h k (r) (40) 

for any k > (with probability 1). It is supposed that the process 7r has a finite 
memory, i.e. belongs to M S (A) for some s. Having taken into account the definition 
of M S (A) (fl|). we obtain the following representation: 

t 

-log7r(xi . . .x t )/t = -r 1 ^log7r(xj| x\ . . .x^i) 

i=l 

k t 
= -t~ X (£,\0g'K(Xi\Xx...Xi-i)+ log 7T(Xi| Xi- k . . .Xi-i)) 

i=l i=k+l 



for any k > s. According to the ergodic theorem there exists a limit 

t 

r 1 ' 



limt 1 V \ogir(x i \xi- 1t ...x i - 1 ] 

i=k+l 



which is equal to h k (r), see So, from the two latter equalities we can see that 

lim (— log ir(xi . . . x t ))/t — — V] t(v) r(a| v) log7r(a| v). 

Taking into account this equality, (|40|) and we can see that 
- log 7r(xi . . . ar t ) - \<p(x! ...x t )\ >t(J2 T ( v ) T ( a \ v ) lo g( r ( a l v )/ n ( a \ v ))) + o(t) 

veA k aeA 

for any k > s. From this inequality and (J7J) we can obtain that — log7r(a;i . . .x t ) — 
\<p(xi . . . x t )\ > ct + o(t), where c is a positive constant, t — > oo. Hence, is true. 

Let us consider where x is a sequence x 1 = xl.-.x^, x' = x\...x\ (i.e. 

x = x l o . . . o x ). The proof of the first statement of the theorem is analogical and 
can be simply repeated for this case. In order to prove the second statement we 
note that the length of at least one sequence x % goes to infinity and, hence, the 
equality is true for that sequence, whereas for all other sequences the differences 
log^a/ 7 ) — | ( or- 7 ) | are either bounded or go to infinity. The theorem is proven. 

Proof of Theorem 2. 

We only consider where the sample x is one sequence xi...xt, because 

the general case is analogical, but requires cumbersome notations. Let us denote 
the critical set of the test T^ I (A,a) as C a , i.e., by definition, C a = {x\...x t : 
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(t — m) h^xi . . .xt) — \ip(xi...xt)\) > log(l/a)}. From (jSJ) we can see that there exists 
such a measure \l v that —\og[i lf (xi...x t ) < \(p(xi...xt)\ . We also define 

C a = {x 1 . . . x t : (t-m) h* m (x 1 ...x t ) - (-log fj, (p (x 1 ...x t ))) > log(l/a)}. (41) 

Obviously, C a D C a . Let 9 be any source from M m (A). The following chain of equal- 
ities and inequalities is true: 

l>/V(C a )= Y /M x i • • - x t) 
xi...x t eC a 

>a^ Y 2 (t - m)h ™ ixi - xt) > a' 1 Y 9(x l ...x t ) = 9(C a ). 

Xi...X t €C a X 1 ...Xt£C a 

(Here both equalities and the first inequality are obvious, the second and the third 
inequalities follow from ()41|) and the Lemma, correspondingly.) So, we obtain that 
9{C a ) < oi for any source 9 G M m (A). Taking into account that C a D C a , where C a 
is the critical set of the test, we can see that the probability of the Type I error is 
not greater than a. The first statement of the theorem is proven. 

The proof of the second statement will be based on some results of Information 
Theory. We obtain from (jlOj) and (j3J) that for any stationary and ergodic p 

\imt- 1 \ V (x l ...x t )\=h 00 (p) (42) 

t— >oo 

with probability 1. It can be seen from (jSJ) that h* m is an estimate for the m— order 
Shannon entropy (J2J). Applying the ergodic theorem we obtain lim^oo h* m (x\ . . . x t ) = 
h m (p) with probability 1; see [21 E]- It is known in Information Theory that h m (g) — 
hoolg) > 0, if q belongs to M 00 (A)\M m (A), see piTT]. It is supposed that H^(A) is 
true, i.e. the considered process belongs to Moo(A) \ M m (A). So, from (|4*2^) and the 
last equality we obtain that lim t _ ) , 00 ((t — m) h^xi . . . x t ) — \ip(xi...x t )\) = oo. This 
proves the second statement of the theorem. 

Proof of Theorem 3. As before, we only consider where the sample x 

is one sequence Xi...x t , because the general case is analogical. Let C a be a critical 
set of the test, i.e., by definition, C a = {(xi, ...,x t ) : J2i=i (t — ^h^x^.-.x^) 
— \<f(xx...xt)\ > log(l/a)}. There exists a measure for which is valid. Hence, 

d 

C a dC* a = {(xi, x t ) : 5^(t - ^Kni^x ■■■ x t ) ) ~ l«g(l/^(a:i, x t ) > log(l/a)}. 

i=i 

(43) 

Let 9 be any measure from M m (A). Then 

1 > (i v (C* a ) > or 1 Y, A 2~ (t ~ m ^ ) - x t ) \ 

xi,...,x t <=C* i=l 

Having taken into account the Lemma, we obtain 

i>MQ)> Y I[o\x?...xi l) )- 

xi,...,xt£C* i=l 
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It is supposed that Hq is true and, hence, (fT3|) is valid. So, from the latter inequal- 
ities we can see that 1 > /i v (C*) > J2 xl ,..., Xt ec* 9(xi, ■■■i x t)- Taking into account that 
Sxi,...,z t ec* 9(x±, ■■■■> x t) = 9(C*) and (f4*3j). we obtain that 9{C a ) < a. So, the first 
statement of the theorem is proven. 

We give a short scheme of the proof of the second statement of the theorem, 
because it is based on well-known facts of Information Theory. It is known that 
h m (fi) — J2i=i h m (fi l ) = if H % Q nd is true and this difference is negative under H{ nd . A 
universal code compresses a sequence till th m (fi) (Informally, it uses dependence for 
the better compression.) That is why the difference (J2i=i h m (fjL l ) — t h m (fj)) goes to 
infinity, when t increases and, hence, H™ d will be rejected. 

Proof of Theorem 4. For short, we consider a case of two samples and i.i.d. 
sources (i.e. m — 0), because a generalization is obvious. So, there are two samples 
\ t and x 2 = x\...x\ 2 generated by sources from M (A). As before, let C a 
be a critical set of the test, i.e., by definition, C a = {(x 1 ,^ 2 ) : (ti + t 2 ) h^x 1 ox 2 ) — 
(^(x 1 )! + |<^(x 2 )|) > log(l/a)}. There exists a measure fi^ for which Q is valid. 
So, C a D C* = {{x\x 2 ) : (*! + t 2 ) h^x 1 o x 2 ) - (logCV/vfc 1 )) + log(l//i v (x 2 ))) 
> log(l/o;)}. Let us suppose that Hq " 1 is true. It means that (x x ,x 2 ) are created by 
some source 9 G M (A). Having taken into account the definition of the set C* and 
Lemma, we obtain the following chain of inequalities: 

1>/V(C:)= E ^{x l ox 2 )> 

a -i 2-^ +t ^ x1 ^ > J2 9(x 1 ox 2 ) = 9(C* a ). 

Hence, 9(C*) < a and, taking into account that the critical set C a C C*, we finish 
the proof of the first statement of the theorem. 

Let us suppose that H± om is true, i.e. the samples x 1 , x 2 are generated by different 
sources 9i, 9 2 , correspondingly. For any 7 e (0, 1) we define 9 1 = jOi + (1 — j)9 2 and 
let 

5= inf (h (9 1 )-(h (9 1 ) + h (9 2 ))), (44) 
7e[c,i-c] 

where c is defined in (|17|) . Due to the Jensen inequality for the Shannon entropy, we 
can easily see that 5 > 0. Having taken into account the definition of a universal code 
and ergodicity of 9%, 9 2 we obtain that 

(h +t 2 )K(x 1 ox 2 ) - Mx 1 )] + \<p(x 2 )\) = (t 1+ t 2 ) (h (-^-9 1 + ~^-9 2 ) - 

Z\ -+- l 2 Z\ -+- 1 2 

( r ^ T h (9 1 ) + -^-h (9 2 ) ) ) + o(h + h), 

ll T l 2 l\ T l 2 

(with probability 1), if (ti + t 2 ) — > 00. Taking into account the definition and 
fTTj) we obtain from the last equality that 

(ti + t 2 ) hl{x l ox 2 ) - {\^{x v )\ + \^{x 2 )\) >6(h + t 2 ) + o(h + t 2 ). 
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Hence, the difference 

(h + t 2 )h* (x l o x 2 ) - Mx l )\ + \<p(x 2 )\) 

goes to infinity and the second statement of the theorem is proven. 

Proof of Theorem 5. The following chain proves the first statement of the theorem: 

Pr{Hy (A) is rejected /H is true} = 

oo 

Pr{\J{H^ (Aj) is rejected /H is true} } < 



i=i 

oo oo 

^PriH^Ai) /H is true} = a. 

1=1 1=1 



(Here both inequalities follow from the description of the test, whereas the last equal- 
ity follows from (|18|1.) 

The second statement also follows from the description of the test. Indeed, let 
a sample be created by a source g, for which Hi(A)^ is true. It is supposed that 
the sequence of partitions A discriminates between Hq(A), Hf(A). By definition, it 
means that there exists j for which Hf(Aj) is true for the process g\.. It immediately 
follows from Theorem 1-4 that the Type II error of the test T^(Aj,aUj) goes to 0, 
when the sample size tends to infinity. 

Proof of Claim 1. From (J34j) we obtain: 

-lo g K ( Xl ... Xt ) = -M !W ^ F r((t + |A|/2) } 

= ci + c 2 \A\ + \ogT(t + \A\/2) - J2 r(^ 1 ... 2t (a) + 1/2), 
where c±, c 2 are constants. Now we use the well known Stirling formula 
lnT(s) = lnV2^+(s-l/2)lns-s + ^/12, 

where 9 G (0, 1), see, for ex., [TJj. Using this formula we rewrite the previous equality 

as 

-\ogK (x 1 ...x t ) = -J2 ^og{v xl ... Xt {a)/t) + {\A\ - l)logt/2 + c x + c 2 \A\, 

a<=A 

where Ci,c 2 are constants. Having taken into account the definition of the empirical 
entropy ©, we obtain 

- logK ( Xl ...x t ) < th* { Xl ...x t ) + (\A\ - 1) logt/2 + c\A\. 

Hence, 

p(xi . . .x t )(-\og{K Q (xi...x t ))) 

xi...xt&A t 
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<t( ]T p(x 1 ...x t )h* ( y x 1 ...x t ) + (\A\-l)\ogt/2 + c\A\. 

X\...Xt^A l 

Having taken into account the definition we apply the well known Jensen inequal- 
ity for the concave function — xlogx and obtain the following inequality: 

p(x 1 ...x t )(-log(K ( y x 1 ...x t )) < 

x\...xt^A t 

-t{ J2 p{xi...x t )({u Xl ... Xt {a) ft)) log Yl p(xi...x t )(u Xl ... Xt (a)/t) + (\A\-l)logt/2+c\A\. 

xi^^tdA 1 xi^.xtGAt 

The source p is stationary and ergodic, so the average frequency Y^xi-.-xteA* P{ x \ ■ ■ ■ Xt)v Xl ... Xt (a) 
is equal to p(a) for any a £ A and we obtain from two last formulas the following 
inequality: 

p{x 1 ...x t ){-log{K Q {x 1 ...x t )) < th {p) + (\A\ - 1) logt/2 + c\A\ 

x\...xt^A t 

( where h (p) = —Y.aeAP( a ) l°gp( a ) is the Shannon entropy). Claim 1 is proven. 
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